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Abstract 

Common Representation Learning (CRL), wherein different descriptions (or views) 
of the data are embedded in a common subspace, is receiving a lot of attention re¬ 
cently. Two popular paradigms here are Canonical Correlation Analysis (CCA) based 
approaches and Autoencoder (AE) based approaches. CCA based approaches learn a 
joint representation by maximizing correlation of the views when projected to the com¬ 
mon subspace. AE based methods learn a common representation by minimizing the 
error of reconstructing the two views. Each of these approaches has its own advan¬ 
tages and disadvantages. Eor example, while CCA based approaches outperform AE 
based approaches for the task of transfer learning, they are not as scalable as the latter. 
In this work we propose an AE based approach called Correlational Neural Network 
(CorrNet), that explicitly maximizes correlation among the views when projected to the 
common subspace. Through a series of experiments, we demonstrate that the proposed 
CorrNet is better than the above mentioned approaches with respect to its ability to 
learn correlated common representations. Eurther, we employ CorrNet for several cross 
language tasks and show that the representations learned using CorrNet perform better 
than the ones learned using other state of the art approaches. 

1 Introduction 

In several real world applications, the data contains more than one view. Eor example, 
a movie clip has three views (of different modalities) : audio, video and text/subtitles. 

^ Work done while the first author was at IIT Madras and IBM Research India. 



However, all the views may not always be available. For example, for many movie elips, 
audio and video may be available but subtitles may not be available. Reeently there has 
been a lot of interest in learning a eommon representation for multiple views of the 


data (Ngiam et al. 2011! Klementiev et ah! 2012! Chandar et ah! 2013, 2014 Andrew 

et al.l 2013! Hermann & Blunsom! 

2014b! Wang et al. 2015 

) which can be useful in 


several downstream applications when some of the views are missing. We consider 
four applications to motivate the importance of learning common representations: (i) 
reconstruction of a missing view, (ii) transfer learning, (iii) matching corresponding 
items across views, and (iv) improving single view performance by using data from 
other views. 

In the first application, the learned common representations can be used to train a 
model to reconstruct all the views of the data (akin to autoencoders reconstructing the 
input view from a hidden representation). Such a model would allow us to reconstruct 
the subtitles even when only audio/video is available. Now, as an example of transfer 
learning, consider the case where a profanity detector trained on movie subtitles needs 
to detect profanities in a movie clip for which only video is available. If a common 
representation is available for the different views, then such detectors/classifiers can be 
trained by computing this common representation from the relevant view (subtitles, in 
the above example). At test time, a common representation can again be computed from 
the available view (video, in this case) and this representation can be fed to the trained 
model for prediction. Third, consider the case where items from one view (say, names 
written using the script of one language) need to be matched to their corresponding 
items from another view (names written using the script of another language). One way 
of doing this is to project items from the two views to a common subspace such that the 
common representations of corresponding items from the two views are correlated. We 
can then match items across views based on the correlation between their projections. 
Finally, consider the case where we are interested in learning word representations for a 
language. If we have access to translations of these words in another language then these 
translations can provide some context for disambiguation which can lead to learning 
better word representations. In other words, jointly learning representations for a word 
in language Li and its translation in language L 2 can lead to better word representations 
in Li (see section |^. 

Having motivated the importance of Common Representation Learning (CRL), we 
now formally define this task. Consider some data Z = which has two views: 

X and Y. Each data point Zj can be represented as a concatenation of these two views : 
Zj = (xj, Yi), where Xj G and y* G In this work, we are interested in learning 
two functions, hx and hy, such that hxi'^i) G and /ly(yi) G are projections of 
Xj and yj respectively in a common subspace (M^) such that for a given pair Xj, yj : 


1. hx{xi) and /ly (y*) should be highly correlated. 

2. It should be possible to reconstruct y* from Xj (through hx{y^i)) and vice versa. 

Canonical Correlation Analysis (CCA) ( Hotelling! 1936) is a commonly used tool 
for learning such common representations for two-view data (Udupa & Khapra[ 2010 


Dhillon et al.[ 2011). By definition, CCA aims to produce correlated common repre¬ 


sentations but, it suffers from some drawbacks. First, it is not easily scalable to very 


2 


























large datasets. Of eourse, there are some approaehes whieh try to make CCA sealable 
(for example, (Lu & Foster 20141), but sueh sealability eomes at the eost of perfor- 
manee. Further, sinee CCA does not explieitly foeus on reeonstruetion, reeonstrueting 
one view from the other might result in low quality reeonstruetion. Finally, CCA eannot 
benefit from additional non-parallel, single-view data. This puts it at a severe disadvan¬ 
tage in several real world situations, where in addition to some parallel two-view data, 
abundant single view data is available for one or both views. 

Reeently, Multimodal Autoencoders (MAEs) ( [Ngiarn et al. 20111 have been pro¬ 
posed to learn a common representation for two views/modalities. The idea in MAE 
is to train an autoencoder to perform two kinds of reconstruction. Given any one view, 
the model learns both self-reconstruction and cross-reconstruction (reconstruction of 
the other view). This makes the representations learnt to be predictive of each other. 
However, it should be noticed that the MAE does not get any explicit learning signal 
encouraging it to share the capacity of its common hidden layer between the views. In 
other words, it could develop units whose activation is dominated by a single view. This 
makes the MAE not suitable for transfer learning, since the views are not guaranteed to 
be projected to a common subspace. This is indeed verified by the results reported in 
( Ngiarn et al.[ 2011) where they show that CCA performs better than deep MAE for the 
task of transfer learning. 

These two approaches have complementary characteristics. On one hand, we have 
CCA and its variants which aim to produce correlated common representations but 
lack reconstruction capabilities. On the other hand, we have MAE which aims to do 
self-reconstruction and cross-reconstruction but does not guarantee correlated common 
representations. In this paper, we propose Correlational Neural Network (CorrNet) as 
a method for learning common representations which combines the advantages of the 
two approaches described above. The main characteristics of the proposed method can 
be summarized as follows: 


• It allows for self/cross reconstruction. Thus, unlike CCA (and like MAE) it has 
predictive capabilities. This can be useful in applications where a missing view 
needs to be reconstructed from an existing view. 

• Unlike MAE (and like CCA) the training objective used in CorrNet ensures that 
the common representations of the two views are correlated. This is particularly 
useful in applications where we need to match items from one view to their cor¬ 
responding items in the other view. 

• CorrNet can be trained using Gradient Descent based optimization methods. Par¬ 
ticularly, when dealing with large high dimensional data, one can use Stochastic 
Gradient Descent with mini-batches. Thus, unlike CCA (and like MAE) it is easy 
to scale CorrNet. 

• The procedure used for training CorrNet can be easily modified to benefit from 
additional single view data. This makes CorrNet useful in many real world appli¬ 
cations where additional single view data is available. 

We evaluate CorrNet using four different experimental setups. Eirst, we use the 
MNIST hand-written digit recognition dataset to compare CorrNet with other state of 
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the art CRL approaches. In particular, we evaluate its (i) ability to self/cross reconstruct 
(ii) ability to produce correlated common representations and (iii) usefulness in trans¬ 
fer learning. In this setup, we use the left and right halves of the digit images as two 
views. Next, we use CorrNet for a transfer learning task where the two views of data 
come from two different languages. Specifically, we use CorrNet to project parallel 
documents in two languages to a common subspace. We then employ these common 
representations for the task of cross language document classification (transfer learn¬ 
ing) and show that they perform better than the representations learned using other state 
of the art approaches. Third, we use CorrNet for the task of transliteration equivalence 
where the aim is to match a name written using the script of one language (first view) to 
the same name written using the script of another language (second view). Here again, 
we demonstrate that with its ability to produce better correlated common representa¬ 
tions, CorrNet performs better than CCA and MAE. Finally, we employ CorrNet for 
a bigram similarity task and show that jointly learning words representations for two 
languages (two views) leads to better words representations. Specifically, representa¬ 
tions learnt using CorrNet help to improve the performance of a bigram similarity task. 
We would like to emphasize that unlike other models which have been tested mostly 
in only one of these scenarios, we demonstrate the effectiveness of CorrNet in all these 
different scenarios. 

The remainder of this paper is organized as follows. In section we describe the 
architecture of CorrNet and outline a training procedure for learning its parameters. In 
section 1^ we propose a deep variant of CorrNet. In section we briefly discuss some 
related models for learning common representations. In section]^ we present experi¬ 
ments to analyze the characteristics of CorrNet and compare it with CCA, KCCA and 
MAE. In section]^ we empirically compare Deep CorrNet with some other deep CRE 
methods. In sections and|^ we report results obtained by using CorrNet for the 
tasks of cross language document classification, transliteration equivalence detection 
and bigram similarity respectively. Finally, we present concluding remarks in section 
[T^and highlight possible future work. 


2 Correlational Neural Network 

As described earlier, our aim is to learn a common representation from two views of 
the same data such that: (i) any single view can be reconstructed from the common 
representation, (ii) a single view can be predicted from the representation of another 
view and (iii) like CCA, the representations learned for the two views are correlated. 
The first goal above can be achieved by a conventional autoencoder. The first and 
second can be achieved together by a Multimodal autoencoder but it is not guaranteed 
to project the two views to a common subspace. We propose a variant of autoencoders 
which can work with two views of the data, while being explicitly trained to achieve all 
the above goals. In the following sub-sections, we describe our model and the training 
procedure. 
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2.1 Model 


We start by proposing a neural network arehiteeture whieh eontains three layers: an 
input layer, a hidden layer and an output layer. Just as in a eonventional single view 
autoeneoder, the input and output layers have the same number of units, whereas the 
hidden layer ean have a different number of units. For illustration, we eonsider a two- 
view input z = (x, y). For all the diseussions, [x, y] denotes a eoneatenated veetor of 
size di + d 2 . 

Given z = (x, y), the hidden layer computes an encoded representation as follows: 

h{z) = /(Wx -I- Vy -I- b) 

where W is a fc x di projection matrix, V is ak x d 2 projection matrix and b is a A: x 1 
bias vector. Function / can be any non-linear activation function, for example sigmoid 
or tank. The output layer then tries to reconstruct z from this hidden representation by 
computing 

z' = g{\Wh{z),\'h{z)] + h') 

where W' is a di x A: reconstruction matrix, V' is a ^2 x ^ reconstruction matrix and b' 
is a {di -f ^ 2 ) X 1 output bias vector. Vector z' is the reconstruction of z. Function g can 
be any activation function. This architecture is illustrated in Figure The parameters 
of the model are 6* = {W, V,W', V',b,b'}. In the next sub-section we outline a 
procedure for learning these parameters. 



Figure 1: Correlational Neural Network 


2.2 Training 

Restating our goals more formally, given a two-view data Z = {(zj)}^^ = {(xj, y*)}^!, 
for each instance, (xj, y*), we would like to: 

• Minimize the self-reconstruction error, i.e., minimize the error in reconstructing 
Xj from Xj and yj from yj. 

• Minimize the cross-reconstruction error, i.e., minimize the error in reconstructing 
Xj from yj and yj from Xj. 
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• Maximize the correlation between the hidden representations of both views. 

We achieved this by finding the parameters 6 = {W, V, W', V', b, b'} which minimize 
the following objective function: 

N 

Jz{0) = g{h{z^)))+L{zi, g{h{:sii)))+L{zi, g{h{yi))))-XcoYT{h{X), h{Y)) 

i=l 


corr(h(X), h{Y)) = 




where L is the reconstruction error, A is the scaling parameter to scale the fourth term 
with respect to the remaining three terms, h{X) is the mean vector for the hidden rep¬ 
resentations of the first view and h{Y) is the mean vector for the hidden representations 
of the second view. If all dimensions in the input data take binary values then we use 
cross-entropy as the reconstruction error otherwise we use squared error loss as the 
reconstruction error. For simplicity, we use the shorthands h(xj) and h{yi) to note the 
representations 0)) and /i((0, yi)) that are based only on a single viewj^ For each 
data point with 2 views x and y, hi^i) just means that we are computing the hidden rep¬ 
resentation using only the x-view. Or in other words, in equation h(z) = f( Wx-i- Vy-i- 
b), we set y=0. So, h(x) = f( Wx-i- b). /;,(x) = /i(x, 0) is not a choice per se, but a no¬ 
tation we are defining. h{z), h(x) and h{y) are certainly not guarantied to be identical, 
though training will gain in making them that way, because of the various reconstruc¬ 
tion terms. The correlation term in the objective function is calculated considering the 
hidden representation as a random vector. 

In words, the objective function decomposes as follows. The first term is the usual 
autoencoder objective function which helps in learning meaningful hidden representa¬ 
tions. The second term ensures that both views can be predicted from the shared repre¬ 
sentation of the first view alone. The third term ensures that both views can be predicted 
from the shared representation of the second view alone. The fourth term interacts with 
the other objectives to make sure that the hidden representations are highly correlated, 
so as to encourage the hidden units of the representation to be shared between views. 

We can use stochastic gradient descent (SGD) to find the optimal parameters. For 
all our experiments, we used mini-batch SGD. The fourth term in the objective function 
is then approximated based on the statistics of a minibatch. Approximating second 
order statistics using minibatches for training was also used successfully in the batch 


normalization training method of Ioffe & Szegedy (20151. 

The model has four hyperparameters: (i) the number of units in its hidden layer, 
(ii) A, (iii) mini-batch size, and (iv) the SGD learning rate. The first hyperparameter 
is dependent on the specific task at hand and can be tuned using a validation set (ex¬ 
actly as is done by other competing algorithms). The second hyperparameter is only 
to ensure that the correlation term in the objective function has the same range as the 
reconstruction errors. This is again easy to approximate based on the given data. The 


'They represent the generic functions hx and hy mentioned in the introduction. 
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third hyperparameter approximates the correlation of the entire dataset and larger mini¬ 
batches are preferred over smaller mini-batches. The final hyperparameter, the learning 
rate is common for all neural network based approaches. 

Once the parameters are learned, we can use the CorrNet to compute representa¬ 
tions of views that can potentially generalize across views. Specifically, given a new 
data instance for which only one view is available, we can compute its corresponding 
representation (/i(x) if x is observed or h{y) if y is observed) and use it as the new data 
representation. 

2.3 Using additional single view data 

In practice, it is often the case that we have abundant single view data and compar¬ 
atively little two-view data. For example, in the context of text documents from two 
languages {X and Y), typically the amount of monolingual (single view) data available 
in each language is much larger than parallel (two-view) data available between X and 
Y. Given the abundance of such single view data, it is desirable to exploit it in order to 
improve the learned representation. CorrNet can achieve this, by using the single view 
data to improve the self-reconstruction error as explained below. 

Consider the case where, in addition to the data Z = {(zj)}^]^ = {(xj, 
we also have access to the single view data X = and 3^ = {(yi)}iIVi- 

Now, during training, in addition to using Z as explained before, we also use X and y 
by suitably modifying the objective function so that it matches that of a conventional 
autoencoder. Specifically, when we have only x*, then we could try to minimize 

Ni 

Jx{0)= L(xi,^(/i(xi))) 

i=N+l 

and similarly for ji- 

In all our experiments, when we have access to all three types of data (z.e., X, y and 
Z), we construct 3 sets of mini-batches by sampling data from X, y and Z respectively. 
We then feed these mini-batches in random order to the model and perform a gradient 
update based on the corresponding objective function. 

3 Deep Correlational Neural Networks 

An obvious extension for CorrNets is to allow for multiple hidden layers. The main mo¬ 
tivation for having such Deep Correlational Neural Networks is that a better correlation 
between the views of the data might be achievable by more non-linear representations. 
We use the following procedure to train a Deep CorrNet. 

1. Train a shallow CorrNet with the given data (see step-1 in Figure]^. At the end 
of this step, we have learned the parameters W, V and b. 

2. Modify the CorrNet model such that the first input view connects to a hidden 
layer using weights W and bias b. Similarly connect the second view to a hidden 
layer using weights V and bias b. We have now decoupled the common hidden 
layer for each view (see step-2 in Figure [^. 


7 


3. Add a new common hidden layer which takes its input from the hidden layers 
created at step 2. We now have a CorrNet which is one layer deeper (see step-3 
in Figure [^. 

4. Train the new Deep CorrNet on the same data. 

5. Repeat steps 2, 3 and 4, for as many hidden layers as required. 






step-1 


step-2 


step-3 


Figure 2: Stacking CorrNet to create Deep Correlartional Neural Network. 


We would like to point out that we could have followed the procedure described in 
Chandar et al.| ( |2013| ) for training Deep CorrNet. In |Chandar et al.| ( |2013| ), they learn 
deep representation for each view separately and use it along with a shallow CorrNet 
to learn a common representation. However, feeding non-linear deep representations to 
a shallow CorrNet makes it harder to train the CorrNet. Also, we chose not to use the 


deep training procedure described in Ngiam et al. (2011) since the objective function 
used by them during pre-training and training is different. Specifically, during pre¬ 
training the objective is to minimize self reconstruction error whereas during training 
the objective is to minimize both self and cross reconstruction error. In contrast, in the 
stacking procedure outlined above, the objectives during training and pre-training are 
aligned. 

Our current training procedure for deep CorrNet is similar to greedy layerwise pre¬ 
training of deep autoencoders. We believe that this procedure is more faithful to the 
global training objective of Corrnet and it works well. We do not have strong empirical 


evidence that this is superior to other methods such as the one described in Chandar 


et al.| ( |20T3] ) and [Ngiam et al.| ( |2011[ ). When we have less parallel data, using method 
described in [Chandar et al.j (|2013|) makes more sense and each method has its own 


advantages. We leave a detailed comparison of these different alternatives of Deep 
CorrNet as future work. 
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4 Related Models 


In this section, we describe other related models for common representation learning. 
We restrict our discussion to CCA based methods and Neural Network based methods 
only. 

Canonical Correlation Analysis (CCA) ( |Hotelling| |1936| ) and its variants, such as 
regularized CCA ( Vinod[ 1976[ Nielsen et al. 1998 Cruz-Cano & Lee 20141 are the 
de-facto approaches used for learning common representation for two different views 
in the literature ( |Udupa & Khapraf 2010; Dhillon et al. 20111. Kernel CCA (Akaho 


200 It [Hardoon et ahj |2004|) which is another variant of CCA uses the standard ker¬ 


nel trick to find pairs of non-linear projections of the two views. Deep CCA, a deep 
version of CCA is also introduced in (I Andrew et al.l 120131). One issue with CCA is 


that it is not easily scalable. Even though there are several works on scaling CCA (see 
(Lu & Foster, 20141), they are all approximations to CCA and hence lead to a decrease 
in the performance. Also is not very trivial to extend CCA to multiple views. How¬ 
ever there are some recent work along this line (Tenenhaus & Tenenhansj 2011; Luo 


et al. 2015) which require complex computations. Lastly, conventional CCA based 


models can work only with parallel data. However, in real life situations, parallel data 
is costly when compared to single view data. The inability of CCA to leverage such 
single view data acts as a drawback in many real world applications. Representation 
Constrained CCA (RCCCA) dMishraj |2009| ) is one such model which can benefit from 
both single view data and multiview data. It effectively uses a weighted combination 
of PCA (for single view) and CCA (for two views) by minimizing self-reconstruction 
errors and maximizing correlation. CorrNet, in contrast, minimizes both self and cross 
reconstruction error while maximizing correlation. RCCCA can also be considered as 
a linear version of DCCAE proposed injWang et al.|(|2015]). 


Hsieh| ( [2000] ) is one of the earliest Neural Network based model for nonlinear CCA. 
This method uses three feedforward neural networks. The first neural network is a 
double-barreled architecture where two networks project the views to a single unit such 
that the projections are maximally correlated. This network is first trained to maximize 
the correlation. Then the inverse mapping for each view is learnt from the correspond¬ 
ing canonical covariate representation by minimizing the reconstruction error. There are 
clear differences between this Neural CCA model and CorrNet. First, CorrNet is a sin¬ 
gle neural network which is trained with a single objective function while Neural CCA 
has three networks trained with different objective functions. Second, Neural CCA does 
only correlation maximization and self-reconstruction, whereas CorrNet does correla¬ 
tion maximization, self-reconstruction and cross-reconstruction, all at the same time. 

Multimodal Autoencoder (MAE) ( [Ngiarn et al. 20111 is another Neural Network 
based CRL approach. Even though the architecture of MAE is similar to that of Corr¬ 
Net there are clear differences in the training procedure used by the two. Firstly, MAE 
only aims to minimize the following three errors: (i) error in reconstructing Zi from Xi 
(El), (ii) error in reconstructing Zi from yi (E 2 ) and (iii) error in reconstructing Zi from 
Zi (E^). More specifically, unlike the fourth term in our objective function, the objective 
function used by MAE does not contain any term which forces the network to learn cor¬ 
related common representations. Secondly, there is a difference in the manner in which 
these terms are considered during training. Unlike CorrNet, MAE only considers one 
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of the above terms at a time. In other words, given an instanee Zi = (a;*, yi) it first tries 
to minimize Ei and updates the parameters aeeordingly. It then tries to minimize E 2 
followed by E^^. Empirieally, we observed that a training proeedure whieh eonsiders all 
three loss terms together performs better than the one whieh eonsiders them separately 
(Refer Seetion 5.5). 

Deep Canonical Correlation Analysis (DCCA) ( [Andrew et al.[ |2013| ) is a recently 
proposed Neural Network approach for CCA. DCCA employs two deep networks, one 
per view. The model is trained in such a way that the final layer projections of the data 
in both the views are maximally correlated. DCCA maximizes only correlation whereas 
CorrNet maximizes both, correlation and reconstruction ability. Deep Canonically Cor¬ 
related Auto Encoders (DCCAE) ( jWang et^ |2015| ) (developed in parallel with our 
work) is an extension of DCCA which considers self reconstruction and correlation. 
Unlike CorrNet it does not consider cross-reconstruction. 


5 Analysis of Correlational Neural Networks 


In this section, we perform a set of experiments to compare CorrNet, CCA ([Hotelling 


1936), Kernel CCA (KCCA) (Akaho 2001) and MAE (Ngiam et al. 20111 based on: 


• ability to reconstruct a view from itself 


• ability to reconstruct one view given the other 


• ability to learn correlated common representations for the two views 


• usefulness of the learned common representations in transfer learning. 


Eor CCA, we used a C-i-i- library called dlib ( 

King, 

2009 

). Eor KCCA, we used an 

implementation provided by the authors of (Arora & Eivescu 

2012 1 . We implemented 

CorrNet and MAE using Theano (Bergstra et al. 201C 

)). 


5.1 Data Description 

We used the standard MNIST handwritten digits image dataset for all our experiments. 
This data consists of 60,000 train images and 10,000 test images. Each image is a 28 * 
28 matrix of pixels; each pixel representing one of 256 grayscale values. We treated the 
left half of the image as one view and the right half of the image as another image. Thus 
each view contains 14 * 28 = 392 dimensions. We split the train images into two sets. 
The first set contains 50,000 images and is used for training. The second set contains 
10,000 images and is used as a validation set for tuning the hyper-parameters of the four 
models described above. 

5.2 Performance of Self and Cross Reconstruction 

Among the four models listed above, only CorrNets and MAE have been explicitly 
trained to construct a view from itself as well as from the other view. So, in this sub¬ 
section, we consider only these two models. Table shows the Mean Squared Errors 
(MSEs) for self and cross reconstruction when the left half of the image is used as input. 
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Model 

MSE for self reconstruction 

MSE for cross reconstruction 

CorrNet 

3.6 

4.3 

MAE 

2.1 

4.2 


Table 1: Mean Squared Error for CorrNet and MAE for self reeonstruction and cross 
reconstruction 

The above table suggests that CorrNet has a higher self reconstruction error and al¬ 
most the same cross reconstruction error as that of MAE. This is because unlike MAE, 
in CorrNet, the emphasis is on maximizing the correlation between the common rep¬ 
resentations of the two views. This goal captured by the fourth term in the objective 
function obviously interferes with the goal of self reconstruction. As we will see in 
the next sub-section, the embeddings learnt by CorrNet for the two views are better 
correlated even though the self-reconstruction error is sacrificed in the process. 

Eigurej^shows the reconstruction of the right half from the left half for a few sample 
images. The figure reiterates our point that both CorrNet and MAE are equally good at 
cross reconstruction. 
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Eigure 3: Reconstruction of right half of the image given the left half. Eirst block shows 
the original images, second block shows images where the right half is reconstructed 
by CorrNet and the third block shows images where the right half is reconstructed by 
MAE. 


5.3 Correlation between representations of two views 

As mentioned above, in CorrNet we emphasize on learning highly correlated represen¬ 


tations for the two views. To show that this is indeed the case, we follow (Andrew et al. 


2013) and calculate the total/sum correlation captured in the 50 dimensions of the com¬ 


mon representations learnt by the four models described above. The training, validation 
and test sets used for this experiment were as described in section 5.1 The results are 
reported in Table 
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Model 

Sum Correlation 

CCA 

17.05 

KCCA 

30.58 

MAE 

24.40 

CorrNet 

45.47 


Table 2: Sum/Total correlation captured in the 50 dimensions of the common represen¬ 
tations learned by different models using MNIST data. 

The total correlation captured in the 50 dimensions learnt by CorrNet is clearly 
better than that of the other models. 

Next, we check whether this is indeed the case when we change the number of 
dimensions. For this, we varied the number of dimensions from 5 to 80 and plotted the 
sum correlation for each model (see Figure]^. For all the models, we tuned the hyper¬ 
parameters for dim = 50 and used the same hyper-parameters for all dimensions. 



Figure 4: Sum/Total correlation as a function of the number of dimensions in the com¬ 
mon representations learned by different models using MNIST data. 

Again, we see that CorrNet clearly outperforms the other models. CorrNet thus 
achieves its primary goal of producing correlated embeddings with the aim of assisting 
transfer learning. 

5.4 Transfer Learning across views 

To demonstrate transfer learning, we take the task of predicting digits from only one half 
of the image. We first learn a common representation for the two views using 50,000 
images from the MNIST training data. For each training instance, we take only one half 
of the image and compute its 50 dimensional common representation using one of the 
models described above. We then train a classifier using this representation. For each 
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test instance, we consider only the other half of the image and compute its common 
representation. We then feed this representation to the classifier for prediction. We use 
the linear SVM implementation provided by ( [Pedregosa et al.[[2011[ ) as the classifier for 
all our experiments. For all the models considered in this experiment, representation 
learning is done using 50,000 train images and the best hyperparameters are chosen 
using the 10,000 images from the validation set. With the chosen model, we report 
5-fold cross validation accuracy using 10,000 images available in the standard test set 
of MNIST data. We report accuracy for two settings (i) Left to Right (training on left 
view, testing on right view) and (ii) Right to Left (training on right view, testing on left 
view). 


Model 

Left to Right 

Right to Left 

CCA 

65.73 

65.44 

KCCA 

68.1 

75.71 

MAE 

64.14 

68.88 

CorrNet 

77.05 

78.81 

Single view 

81.62 

80.06 


Table 3: Transfer learning accuracy using the representations learned using different 
models on the MNIST dataset. 

Single view corresponds to the classifier trained and tested on same view. This is 
the upper bound for the performance of any transfer learning algorithm. Once again, 
we see that CorrNet performs significantly better than the other models. To verify 
that this holds even when we decrease the data for learning common representation to 
10000 images. The results as reported in Tablej^show that even with less data, CorrNet 
perform betters than other models. 


Model 

Left to Right 

Right to Left 

CCA 

66.13 

66.71 

KCCA 

70.68 

70.83 

MAE 

68.69 

72.54 

CorrNet 

76.6 

79.51 

Single view 

81.62 

80.06 


Table 4: Transfer learning accuracy using the representations learned using different 
models trained with 10000 instances from the MNIST dataset. 

5.5 Relation with MAE 

At the face of it, it may seem that both CorrNet and MAE differ only in their objective 
functions. Specifically, if we remove the last correlation term from the objective func¬ 
tion of CorrNet then it would become equivalent to MAE. To verify this, we conducted 
experiments using both MAE and CorrNet without the last term (say CorrNet(123)). 
When using SGD to train the networks, we found that the performance is almost sim¬ 
ilar. However, when we use some advanced optimization technique like RMSProp, 
CorrNet(123) starts performing better than MAE. The results are reported in Table 
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Model 

Optimization 

Left to Right 

Right to Left 

MAE 

SGD 

63.9 

67.98 

CorrNet(123) 

SGD 

63.89 

67.93 

MAE 

RMSProp 

64.14 

68.88 

CorrNet(123) 

RMSProp 

67.82 

72.13 


Table 5: Results for transfer learning aeross views 


This experiment sheds some light on why CorrNet is better than MAE. Even though 
the objective of MAE and CorrNet(123) is same, MAE tries to solve it in a stochastic 
way which adds more noise. However, CorrNet(123) performs better since it is actually 
working on the combined objective function and not the stochastic version (one term at 
a time) of it. 

5.6 Analysis of Loss Terms 

The objective function defined in Section [2^ has the following four terms: 

• Li = L{zi, g{h{zi)) 

• L4 = X coTT{h{X), h{Y)) 

In this section, we analyze the importance of each of these terms in the loss function. 
Eor this, during training, we consider different loss functions which contain different 
combinations of these terms. In addition, we consider four more loss terms for our 
analysis. 

• ^5 = Y.f=iL{yug{h{g^i)) 

• Le = L{^h 9{h{yi)) 

• Lt = 

• ^8 = T.f=iL{yh9{h{yi)) 

where L5 and LQ essentially capture the loss in reconstructing only one view (say, x*) 
from the other view (y*) while L7 and L8 capture the loss in self reconstruction. 

Eor this, we first learn common representations using different loss functions as 
listed in the first column of Table We then repeated the transfer learning experiments 
using common representations learned from each of these models. Eor example, the 
sixth row in the table shows the results when the following loss function is used for 
learning the common representations. 


•JziQ) — El + L2 + T3 + L4 

which is the same as that used in CorrNet. 
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Loss function used for training 

Left to Right 

Right to Left 

Li 

24.59 

22.56 

Li -\- L4 

65.9 

67.54 

L2 -f T3 

71.54 

75 

L2 + T3 -f L4 

76.54 

80.57 

Li L2 T3 

67.82 

72.13 

Li -f L2 -f T3 -f L4 

77.05 

78.81 

L5 + Lq 

35.62 

32.26 

L5 + Lq + L4 

62.05 

63.49 

Lf + Lg 

10.26 

10.33 

Lj -f Lg -f T4 

73.03 

76.08 


Table 6: Comparison of the performanee of transfer learning with representations 
learned using different loss funetions. 


Eaeh even numbered row in the table reports the performance when the correla¬ 
tion term (L4) was used in addition to the other terms in the row immediately before 
it. A pair-wise comparison of the numbers in each even numbered row with the row 
immediately above it suggests that the correlation term (L4) in the loss function clearly 
produces representations which lead to better transfer learning. 


6 Experiments using Deep Correlational Neural Network 


In this section, we evaluate the performance of the deep extension of CorrNet. Having 
already compared with MAE in the previous section, we focus our evaluation here on a 


comparison with DCCA (Andrew et al. 20131. All the models were trained using 10000 
images from the MNIST training dataset and we computed the sum correlation and 
transfer learning accuracy for each of these models. Eor transfer learning, we use the 


linear SVM implementation provided by (Pedregosa et al. 20111 for all our experiments 


and do 5-fold cross validation using 10000 images from MNIST test data. We report 
results for two settings (i) Eeft to Right (training on left view, testing on right view) 
and (ii) Right to Eeft (training on right view, testing on left view). These results are 
summarized in TableIn this Table, model-x-r/ means a model with x units in the first 
hidden layer and y units in second hidden layer. Eor example, CorrNet-500-300-50 is 
a Deep CorrNet with three hidden layers containing 500, 300 and 50 units respectively. 
The third layer containing 50 units is used as the common representation. 


Model 

Sum Correlation 

Left to Right 

Right to Left 

CorrNet-500-50 

47.21 

77.68 

77.95 

DCCA-500-50 

33.00 

66.41 

64.65 

CorrNet-500-300-50 

45.634 

80.46 

80.47 

DCCA-500-500-50 

33.77 

70.06 

72.43 


Table 7: Comparison of sum correlation and transfer learning performance of different 
deep models 
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Both the Deep CorrNets (CorrNet-500-50 and CorrNet-500-300-50) clearly perform 
better than the corresponding DCCA. We notice that for both the transfer learning tasks, 
the 3-layered CorrNet (CorrNet-500-300-50) performs better than the 2-layered Corr- 
Net (CorrNet-500-50) but the sum correlation of the 2-layered CorrNet is better than 
that of the 3-layered CorrNet. 


7 Cross Language Document Classification 


In this section, we will learn bilingual word representations using CorrNet and use these 
representations for the task of cross language document classification. We experiment 
with three language pairs and show that our approach achieves state-of-the-art perfor¬ 
mance. 

Before we discuss bilingual word representations let us consider the task of learning 
word representations for a single language. Consider a language X containing d words 
in its vocabulary. We represent a sentence in this language using a binary bag-of-words 
representation x. Specifically, each dimension Xi is set to 1 if the vocabulary word 
is present in the sentence, 0 otherwise. We wish to learn a fc-dimensional vectorial 
representation of each word in the vocabulary from a training set of sentence bags-of- 
words 

We propose to achieve this by using a CorrNet which works with only a single view 
of the data (see section 2.3| ). Effectively, one can view a CorrNet as encoding an input 
bag-of-words x as the sum of the columns in W corresponding to the words that are 
present in x, followed by a non-linearity. Thus, we can view W as a matrix whose 
columns act as vector representations (embeddings) for each word. 

Let’s now assume that for each sentence bag-of-words x in some source language 
X, we have an associated bag-of-words y for this sentence translated in some target 
language F by a human expert. Assuming we have a training set of such (x, y) pairs, 
we’d like to learn representations in both languages that are aligned, such that pairs 
of translated words have similar representations. The CorrNet can allow us to achieve 
this. Indeed, it will effectively learn word representations (the columns of W and V) 
that are not only informative about the words present in sentences of each language, but 
will also ensure that the representations’ space is aligned between language, as required 
by the cross-view reconstruction terms and the correlation term. 

Note that, since the binary bags-of-words are very high-dimensional (the dimension¬ 
ality corresponds to the size of the vocabulary, which is typically large), reconstructing 
each binary bag-of-word will be slow. Since we will later be training on millions of 
sentences, training on each individual sentence bag-of-words will be expensive. Thus, 
we propose a simple trick, which exploits the bag-of-words structure of the input. As¬ 
suming we are performing mini-batch training (where a mini-batch contains a list of the 
bags-of-words of adjacent sentences), we simply propose to merge the bags-of-words of 
the mini-batch into a single bag-of-words and perform an update based on that merged 
bag-of-words. The resulting effect is that each update is as efficient as in stochastic 
gradient descent, but the number of updates per training epoch is divided by the mini¬ 
batch size. As we’ll see in the experimental section, this trick produces good word 
representations, while sufficiently reducing training time. We note that, additionally. 
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we could have used the stochastic approach proposed by [Dauphin et al.| ( |2011| ) for re¬ 
constructing binary bag-of-words representations of documents, to further improve the 
efficiency of training. They use importance sampling to avoid reconstructing the whole 
V -dimensional input vector. 


7.1 Document representations 

Once we learn the language specific word representation matrices W and V as de¬ 
scribed above, we can use them to construct document representations, by using their 
columns as word vector representations. Given a document d, we represent it as the 
tf-idf weighted sum of its words’ representations: i/’xid) = Wtf-idf(d) for language 
X and V’y (d) = Vtf-idf(d) for language V, where tf-idf{d) is the tf-idf weight vector 
of document d. 

We use the document representations thus obtained to train our document classifiers. 


in the cross-lingual document classification task described in Section 7.3 


7.2 Related Work on Multilingual Word Representations 

Recent work that has considered the problem of learning bilingual representations of 


words usually has relied on word-level alignments. Klementiev et al. (2012) propose 
to train simultaneously two neural network languages models, along with a regular¬ 
ization term that encourages pairs of frequently aligned words to have similar word 
embeddings. Thus, the use of this regularization term requires to first obtain word-level 
alignments from parallel corpora. |Zou et ah ( 2013| ) use a similar approach, with a dif¬ 
ferent form for the regularizer and neural network language models as in ( [Collobert 
et al.[ 20111. In our work, we specifically investigate whether a method that does not 
rely on word-level alignments can learn comparably useful multilingual embeddings in 
the context of document classification. 

Looking more generally at neural networks that learn multilingual representations 


of words or phrases, we mention the work of Gao et al. (2014) which showed that 
a useful linear mapping between separately trained monolingual skip-gram language 
models could be learned. They too however rely on the specification of pairs of words 
in the two languages to align. Mikolov et al. ( 2013| ) also propose a method for training 
a neural network to learn useful representations of phrases, in the context of a phrase- 
based translation model. In this case, phrase-level alignments (usually extracted from 
word-level alignments) are required. Recently, Hermann & Blunsom ( 2014b|a ), pro¬ 
posed neural network architectures and a margin-based training objective that, as in this 
work, does not rely on word alignments. We will briefly discuss this work in the ex¬ 
periments section. A tree based bilingual autoencoder with similar objective function is 
also proposed in (Chandar et al., 2014|). 


7.3 Experiments 

The technique proposed in this work enable us to learn bilingual embeddings which 
capture cross-language similarity between words. We propose to evaluate the quality 
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of these embeddings by using them for the task of eross-language doeument elassifiea- 
tion. We followed elosely the setup used by [Klementiev et aT ( [2012 ) and eompare with 
their method, for which word representations are publicly available^ The set up is as 
follows. A labeled data set of documents in some language X is available to train a 
classifier, however we are interested in classifying documents in a different language Y 
at test time. To achieve this, we leverage some bilingual corpora, which is not labeled 
with any document-level categories. This bilingual corpora is used to learn document 
representations that are coherent between languages X and Y. The hope is thus that we 
can successfully apply the classifier trained on document representations for language 
X directly to the document representations for language Y. Following this setup, we 
performed experiments on 3 data sets of language pairs: English/German (EN/DE), 
English/Erench (EN/ER) and English/Spanish (EN/ES). 

Eor learning the bilingual embeddings, we used sections of the Europarl corpus (|Koehn[ 


2005) which contains roughly 2 million parallel sentences. We considered 3 language 


pairs. We used the same pre-processing as used by [Klementiev et aL] ( |2012| ). We tok- 
enized the sentences using NETK ( Bird Steven & Klein[ 20091, removed punctuations 
and lowercased all words. We did not remove stopwords. 

As for the labeled document classification data sets, they were extracted from sec¬ 
tions of the Reuters RCV1/RCV2 corpora, again for the 3 pairs considered in our exper¬ 
iments. Eollowing [Klementiev et al![ ( [2012[ ), we consider only documents which were 
assigned exactly one of the 4 top level categories in the topic hierarchy (CCAT, ECAT, 
GCAT and MCAT). These documents are also pre-processed using a similar procedure 
as that used for the Europarl corpus. We used the same vocabularies as those used by 
Klementiev et al.[ ( [2012[ ) (varying in size between 35, 000 and 50, 000). 

Models were trained for up to 20 epochs using the same data as described earlier. 
We used mini-batch (of size 20) stochastic gradient descent. All results are for word 


embeddings of size D = 40, as in Klementiev et al. (2012). Eurther, to speed up the 
training for CorrNet we merged each 5 adjacent sentence pairs into a single training 
instance, as described earlier. Eor all language parrs, A was set to 4. The other hyperpa¬ 
rameters were tuned to each task using a training/validation set split of 80% and 20% 
and using the performance on the validation set of an averaged perceptron trained on the 
smaller training set portion (notice that this corresponds to a monolingual classification 
experiment, since the general assumption is that no labeled data is available in the test 
set language). 

We compare our models with the following approaches: 


Klementiev et al. ( 2012[ ): This model uses word embeddings learned by a mul¬ 
titask neural network language model with a regularization term that encourages 
pairs of frequently aligned words to have similar word embeddings. Erom these 
embeddings, document representations are computed as described in Section 7.1[ 


MT: Here, test documents are translated to the langua^ of the training documents 
using a standard phrase-based MT system, MOSEST which was trained using 


’http://klementiev.org/data/distrib/ 
^http://www.statmt.org/moses/ 
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default parameters and a 5-gram language model on the Europarl eorpus (same 
as the one used for indueing our bilingual embeddings). 


Majority Class: Test doeuments are simply assigned the most frequent elass in 
the training set. 


For the EN/DE language pairs, we direetly report the results from Klementiev et al. 
( 2012| ). For the other pairs (not reported in Klementiev et al. (20121), we used the em¬ 
beddings available online and performed the elassifieation experiment ourselves. Simi¬ 
larly, we generated the MT baseline ourselves. 

Table summarizes the results. They were obtained using 1000 RCV training ex¬ 
amples. We report results in both direetions, i.e. language XtoY and viee versa. The 
best performing method in all the pairs except one is CorrNet. In particular, CorrNet 


often outperforms the approach of Klementiev et al. (2012) by a large margin. 

In the last row of the table, we also include the results of some recent work by 
Hermann & Blunsom[ ( 2014b|a[ ). They proposed two neural network architectures for 
learning word and document representations using sentence-aligned data only. Instead 
of an autoencoder paradigm, they propose a margin-based objective that aims to make 
the representation of aligned sentences closer than non-aligned sentences. While their 
trained embeddings are not publicly available, they report results for the EN/DE clas¬ 
sification experiments, with representations of the same size as here (D = 40) and 
trained on 500K EN/DE sentence pairs. Their best model in that setting reaches accu¬ 
racies of 83.7% and 71.4% respectively for the EN —DE and DE EN tasks. One 
clear advantage of our model is that unlike their model, it can use additional mono¬ 
lingual data. Indeed, when we train CorrNet with 500k EN/DE sentence pairs, plus 
monolingual RCV documents (which come at no additional cost), we get accuracies of 
87.9% (EN —!■ DE) and 76.7% (DE —)■ EN), still improving on their best model. If we 
do not use the monolingual data, CorrNet’s performance is worse but still competitive 
at 86.1% for EN —DE and 68.8% for DE —EN. Finally, without constraining D to 


40 (they use 128) and by using additional French data, the best results of Hermann & 
Blunsom| (2014b) are 88.1% (EN —?■ DE) and 79.1% (DE —> EN), the later bei^, to our 
knowledge, the current state-of-the-art (as reported in the last row of Table [8j0 



EN^DE 

DE^EN 

EN^ER 

ER^EN 

EN^ES 

ES ^EN 

CorrNet 

91.8 

74.2 

84.6 

74.2 

49.0 

64.4 

Klementiev et al. 

77.6 

71.1 

74.5 

61.9 

31.3 

63.0 

MT 

68.1 

67.4 

76.3 

71.1 

52.0 

58.4 

Majority Class 

46.8 

46.8 

22.5 

25.0 

15.3 

22.2 

Hermann and Blunsom 

88.1 

79.1 

N.A. 

N.A. 

N.A. 

N.A. 


Table 8: Cross-lingual classification accuracy for 3 language pairs, with 1000 labeled 
examples. 

We also evaluate the effect of varying the amount of supervised training data for 
training the classifier. For brevity, we report only the results for the EN/DE pair, which 


^ After we published our results in (Chandar et al. 


20141, 


Soyer et al. (2015 i have improved the 


performance for EN—and DE—)-EN to 92.7% and 82.4% respectively. 
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oil 
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en 

de 

en 

de 

en 

de 

oil 

61 

microsoft 

microsoft 

market 

markt 

supply 

boden 

cds 

cds 

markets 

marktes 

supplies 

befindet 

insider 

Warner 

single 

markte 

gas 

gerat 

ibm 

tageszeitungen 

commercial 

binnenmarkt 

fuel 

erdol 

acquisitions 

ibm 

competition 

markten 

mineral 

infolge 

shareholding 

handelskammer 

competitive 

handel 

petroleum 

abhangig 

Warner 

exchange 

business 

offnung 

crude 

folge 

online 

veranstalter 

goods 

binnenmarktes 


Table 9: Example English words along with 8 elosest words both in English (en) and 
German (de), using the Euelidean distanee between the embeddings learned by CorrNet 


are summarized in Eigure|^and Eigurej^ We observe that CorrNet elearly outperforms 
the other models at almost all data sizes. More importantly, it performs remarkably well 
at very low data sizes (100), suggesting it learns very meaningful embeddings, though 
the method ean still benefit from more labeled data (as in the DE EN ease). 

Table 1^ also illustrates the properties eaptured within and across languages, for the 
EN/DE pair. Eor a few English words, the words with closest word representations (in 
Euclidean distance) are shown, for both English and German. We observe that words 
that form a translation pair are close, but also that close words within a language are 
syntactically/semantically similar as well. 
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Figure 5: Cross-lingual classification accuracy results for EN —DE 



Eigure 6: Cross-lingual classification accuracy results for DE —)■ EN 


The excellent performance of CorrNet suggests that merging several sentences into 
single bags-of-words can still yield good word embeddings. In other words, not only 
we do not need to rely on word-level alignments, but exact sentence-level alignment 
is also not essential to reach good performances. We experimented with the merging 
of 5, 25 and 50 adjacent sentences into a single bag-of-words. Results are shown in 
Table [T^ They suggest that merging several sentences into single bags-of-words does 
not necessarily impact the quality of the word embeddings. Thus they confirm that exact 
sentence-level alignment is not essential to reach good performances as well. 
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# sent. 

EN^DE 

DE-^EN 

EN->ER 

ER^EN 

EN^ES 

ES ^EN 

CorrNet 

5 

91.75 

72.78 

84.64 

74.2 

49.02 

64.4 

25 

88.0 

64.5 

78.1 

70.02 

68.3 

54.68 

50 

90.2 

49.2 

82.44 

75.5 

38.2 

67.38 


Table 10: Cross-lingual classification accuracy for 3 different pairs of languages, when 
merging the bag-of-words for different numbers of sentences. These results are based 
on 1000 labeled examples. 


8 Transliteration Equivalence 


In the previous section, we showed the application of CorrNet in a cross language learn¬ 
ing setup. In addition to cross language learning, CorrNet can also be used for matching 
equivalent items across views. As a case study, we consider the task of determining 
transliteration equivalence of named entities wherein given a word u written using the 
script of language X and a word v written using the script of language Y the goal is to 
determine whether u and v are transliterations of each other. Several approaches have 
been proposed for this task and the one most related to our work is an approach which 
uses CCA for determining transliteration equivalence. 

We condider English-Hindi as the language pair for which transliteration equiva¬ 
lence needs to be determined. For learning common representations we used approx¬ 
imately 15,000 transliteration pairs from NEWS 2009 English-Hindi training set ( |Ll| 
et al.[ 2009). We represent each Hindi word as a bag of 2860 bigram characters. This 
forms the first view (Xj). Similarly we represent each English word as a bag of 651 
bigram characters. This forms the second view (y,). Each such pair (xj, y*) then serves 
as one training instance for the CorrNet. 

For testing we consider the standard NEWS 2010 transliteration mining test set 


(Kumaran et al. 2010). This test set contains approximately 1000 Wikipedia English 
Hindi title pairs. The original task definition is as follows. For a given English title 
containing Ti words and the corresponding Hindi title containing T 2 words identify 
all pairs which form a transliteration pair. Specifically, for each title parr, consider all 
Ti X T 2 word pairs and identify the correct transliteration pairs. In all, the test set 
contains 5468 word pairs out of which 982 are transliteration pairs. For every word 
pair (xj, yj) we obtain a 50 dimensional common representation for Xj and y^ using the 
trained CorrNet. We then calculate the correlation between the representations of Xj 
and y*. If the correlation is above a threshold we mark the word pair as equivalent. This 
threshold is tuned using an additional 1000 pairs which were provided as training data 


for the NEWS 2010 transliteration mining task. As seen in Table 11 CorrNet clearly 
performs better than the other methods. Note that our aim is not to achieve state of the 
art performance on this task but to compare the quality of the shared representations 
learned using different CRE methods considered in this paper. 
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Model 

El-measure (%) 

CCA 

49.68 

KCCA 

42.36 

MAE 

72.75 

CorrNet 

81.56 


Table 11: Performance on NEWS 2010 En-Hi Transliteration Mining Dataset 


9 Bigram similarity using multilingual word embedding 


In this section, we consider one more dataset/application to compare the performance of 
CorrNet with other state of the art methods. Specifically, the task at hand is to calculate 
the similarity score between two bigram pairs in English based on their representations. 
These bigram representations are calculated from word representations learnt using En¬ 
glish German word pairs. The motivation here is that the German word provides some 
context for disambiguating the English word and hence leads to better word represen¬ 
tations. This task has been already considered in|Mitchell & Eapata|(|2010]), |Eu et al. 


( [2015| ) and |Wang et ah] ( |2015[ ). We follow the similar setup as |Wang et ah] ( |2015[ ) 
and use the same dataset. The English and German words are first represented using 
640-dimensional monolingual word vectors trained via Eatent Semantic Indexing (ESI) 
on the WMT 2011 monolingual news corpora. We used 36,000 such English-German 
monolingual word vector pairs for common representation learning. Each pair consist¬ 
ing of one English (Xj) and one German(j/j) word thus acts as one training instance, 
Zi = (xj,j/i), for the CorrNet. Once a common representation is learnt, we project 
all the English words into this common subspace and use these word embeddings for 
computing similarity of bigram pairs in English. 

The bigram similarity dataset was initially used in Mitchell & Eapata ( |2010| ). We 
consider the adjective-noun (AN) and verb-object (VN) subsets of the bigram similarity 
dataset. We use the same tuning and test splits of size 649/1,972 for each subset. The 
vector representation of a bigram is computed by simply adding the vector representa¬ 
tions of the two words in the bigram. Eollowing previous work, we compute the cosine 
similarity between the two vectors of each bigram pair, order the pairs by similarity, and 
report the Spearman’s correlation (p) between the model’s ranking and human rankings. 

Eollowing |Wang et al.|(|2015]), we fix the dimensionality of the vectors at E = 384. 


Other hyperparameters are tuned using the tuning data. The results are reported in Table 
[T^ where we compare CorrNet with different methods proposed in Wang et ^ (2015 1. 
CorrNet performs better than the previous state-of-the-art (DCCAE) on average score. 
Best results are obtained using CorrNet-500-384. This experiment suggests that apart 
from multiview applications such as (i) transfer learning (ii) reconstructing missing 
view and (iii) matching items across views, CorrNet can also be employed to exploit 
multiview data to improve the performance of a single view task (such as monolingual 
bigram similarity). 


23 



























Model 

AN 

VN 

Avg. 

Baseline (ESI) 

45.0 

39.1 

42.1 

CCA 

46.6 

37.7 

42.2 

SplitAE 

47.0 

45.0 

46.0 

CorrAE 

43.0 

42.0 

42.5 

DistAE 

43.6 

39.4 

41.5 

EKCCA 

46.4 

42.9 

44.7 

NKCCA 

44.3 

39.5 

41.9 

DCCA 

48.5 

42.5 

4.5 

DCCAE 

49.1 

43.2 

46.2 

CorrNet 

46.2 

47.4 

46.8 


Table 12: Spearman’s correlation for bigram similarity dataset. Results for other models 
are taken from|Wang et al.|p015|) 


10 Conclusion and Future Work 

In this paper, we proposed Correlational Neural Networks as a method for learning com¬ 
mon representations for two views of the data. The proposed model has the capability 
to reconstruct one view from the other and it ensures that the common representations 
learned for the two views are aligned and correlated. Its training procedure is also 
scalable. Further, the model can benefit from additional single view data, which is of¬ 
ten available in many real world applications. We employ the common representations 
learned using CorrNet for two downstream applications, viz., cross language document 
classification and transliteration equivalence detection. For both these tasks we show 
that the representations learned using CorrNet perform better than other methods. 

We believe it should be possible to extend CorrNet to multiple views. This could 
be very useful in applications where varying amounts of data are available in different 
views. For example, typically it would be easy to find parallel data for English/German 
and English/Hindi, but harder to find parallel data for German/Hindi. If data from all 
these languages can be projected to a common subspace then English could act as a 
pivot language to facilitate cross language learning between Hindi and German. We 
intend to investigate this direction in future work. 
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